A community effort for COVID-19 Ontology Harmonization

Ontologies have emerged to become critical to support data and knowledge representation, standardization, integration, and analysis. The SARS-CoV-2 pandemic led to the rapid proliferation of COVID-19 data, as well as the development of many COVID-19 ontologies. In the interest of supporting data interoperability, we initiated a community-based effort to harmonize COVID-19 ontologies. Our effort involves the collaborative discussion among developers of seven COVID-19 related ontologies, and the merging of four ontologies. This effort demonstrates the feasibility of harmonizing these ontologies in an interoperable framework to support integrative representation and analysis of COVID-19 related data and knowledge.


Introduction
Despite the development and distribution of effective COVID-19 vaccines, COVID-19 pandemic remains a challenge to overcome. The sheer volume of data collected by researchers, the speed at which it is generated, range of its sources, quality, accuracy, and need for assessment of usefulness, results in complex, multidimensional datasets [1], often annotated in specific terminologies and coding systems by researchers in distinct disciplines. The value of cross-discipline meta-data analysis is obvious, and evident in the present pandemic. However, with the extensive COVID-19 research, we face a big challenge of data silos, which significantly undermine interoperability, meta-data analysis, reproducibility, pattern identification, and discovery and reusability across disciplines [2].
Ontologies -interoperable, logically well-defined, controlled vocabularies representing common entities and relations across disciplines -is a well-known solution to data silo problems. Ontologies are widely used in bioinformatics and biomedical data standardization, supporting data integration, sharing, reproducibility, and automated reasoning. To meet different needs for COVID-19 studies, different groups of ontology developers have worked separately since the start of the pandemic, resulting in the development of several COVID-19 ontologies. A lack of coordination among these groups would risk the proliferation of COVID-19 ontologies using distinct, potentially non-interoperable, vocabularies.
The Workshop on COVID-19 Ontologies (WCO-2020) held on Oct. 23 and Oct. 30, 2020 brought the developers from international groups to report their efforts on building COVID-19 related ontologies. To harmonize heterogeneous knowledge and data for better COVID-19 study, the workshop attendees formed a COVID-19 Ontology Harmonization Working Group (WG) and discussed the ways to harmonize these related ontologies. This paper reports the current results of the harmonization effort conducted by the WG.

Scope and Methods
In this study, the following seven COVID-19 related ontologies were covered in the ontology harmonization process by the COVID-19 Ontology Harmonization Working Group:

7.
Ontology for collection and analysis of COviD-19 data (CODO) [7] Each of the above ontologies has their own scope and purpose.  [5].
The mission statement of the COVID-19 Ontology Harmonization WG is to harmonize different COVID-19 related ontologies to support COVID-19 related data and knowledge interoperability. To achieve the mission, WG members held regular virtual Zoom meetings and communicated through emails. We identified overlapping domains or subdomains from different ontology groups and built consensus on ontology terms needed to characterize specific COVID-19 related entities.

VIDO
VIDO (https://bioportal.bioontology.org/ontologies/VIDO) is an extension of the IDO designed to bridge IDO -which is composed of terms common to any scientific investigation of infectious disease -to virus-specific ontologies. As such, VIDO follows OBO Foundry guidelines closely. VIDO is composed of terms common to any investigation of viral infectious diseases, including virus classification, virus infection epidemiology, pathogenesis, and treatment. For example, VIDO defines terms such as virus, prion, viricide, virus infection incidence, and so on.

COVoc
Controlled Vocabulary for COVID-19 (COVoc) (https://github.com/EBISPOT/covoc) is an application ontology created in collaboration between the European Bioinformatics Institute (EMBL-EBI) and the Swiss Institute of Bioinformatics (SIB) in March 2020. Its primary use case is to enable seamless annotation of biomedical literature to core databases and ELIXIR tools (ELIXIR is a European-wide intergovernmental organization for life sciences). The ontology covers 9 axes related to the COVID-19 pandemic (biomedical vocabulary, cell lines, chemical entities, clinical trials, conceptual entities, diseases and syndromes, geographic locations, organisms, and proteins and genomes). COVoc utilizes existing OBO ontologies where possible to augment connections to other useful resources such as the COVID-19 Data Portal (https://www.covid19dataportal.org/).

HoIP
Homeostasis imbalance process ontology (HoIP) (https://bioportal.bioontology.org/ ontologies/HOIP) focuses on homeostatic imbalances between virus action and innate defense processes and covers the causal relationship of organelle/cellular/organ processes from early stage to clinical manifestation in COVID-19. The design patterns between CIDO and HoIP have now been aligned after shared discussion and communication.

MAxO
Medical Action Ontology (MAxO), launched in the spring of 2020, is a broad ontology that provides a structured vocabulary to medical procedures, interventions, therapies, treatments, or clinical recommendations. MAxO was designed to provide a thorough resource for annotating medical actions to diseases, particularly rare diseases. Given the broad nature of MAxO and the timing of the ontology development, much of the hierarchy was added with a keen awareness of the diagnostics and treatment of SARS-CoV-2. While there are no COVID-19-specific terms, terms like 'ventilation with proning' (MAXO:0000619) and 'clinical RNA detection testing' (MAXO:0000592) were added to annotate COVID-19 clinical data sets. To capture the relationship between treatments and diseases, a new tool, Phenotypic Observation Explication Tool (POET), was developed to establish a relationship between MAxO, Human Phenotype Ontology (HPO), and Mondo Disease Ontology (Mondo) terms. This tool will allow researchers to actively participate in annotating COVID-19 data sets or other diseases in their expertise. MAxO annotations and the POET tool will be available on the HPO website (hpo.jax.org) by 2022.

Ontology Overlapping and Term Reuse
The ontology harmonization process started from identifying the scopes and development methods by different ontologies covered in this work. We found that instead of reinventing the wheel, each ontology has imported and reused many terms from other ontologies where possible (

Ontology Alignment and Harmonization
Given that most of the 7 ontologies follow the OBO Foundry ontology development principles, such as reusing terms defined in OBO foundry ontologies, Our harmonization exercise found that these ontologies can be aligned under the Basic Formal Ontology (BFO) upper level ontology (Figure 1). Figure 1 below shows how VIDO, CIDO, IDO-COVID-19, MAxO and HoIP can fit into BFO's structure.
The relationship between CIDO and IDO-COVID-19 provides an example of precisely the sort of distinct overlapping ontology development efforts our working group was created to address. Via this alignment exercise and observing the scope of CIDO appears broad enough to include IDO-COVID-19, our working group has decided to incorporate the latter ontology into CIDO. Incorporation of terms from IDO-COVID-19 into CIDO will, moreover, strengthen the logical relationship between CIDO and VIDO, given how closely related VIDO and IDO-COVID-19 are.
The HoIP developers are working on mapping and aligning with all GO process terms. Concerning harmonization, HoIP ontology has started to compare their processual entities to those in CIDO. For example, although the labels of 'SARS-CoV-2 entry to cell' (CIDO:0000088) and 'viral entry into host cell [COVID-19]' (HoIP:0037063) are different, as the HoIP entity is described using object property restriction ('has agent' some SARS-CoV2), it can be mapped to correspondent CIDO term. As an application ontology, the COVoc developers rely on CIDO developers to create new terms, and COVoc imports and reuses CIDO for their application purpose. At the time of writing, CODO developers started to align the current build to BFO as its upper ontology, which increases the future possibilities of better alignment.

Discussions
While ontology creates a common language and reduces the work of mapping, the emergence of multiple ontologies may form individual silos by themselves. Given the report of many COVID-19 related ontologies, our COVID-19 Ontology Harmonization WG provided a timely effort to collaboratively identify the overlapping between different ontologies and achieve the harmonization of seven ontologies. Currently, seven ontologies have very different perspectives due to their use cases. Entities within these seven ontologies are defined heterogeneously and described in various ways with various granularities. One should align not only the same URIs but also the meaning (semantics) of the entities. Therefore, it is necessary to investigate and compare entities among ontologies carefully, such as definition, superclass, logical restrictions, and related entities. Towards the formal alignment of these ontologies, we plan to clarify and make explicit the relationships such as equivalent class among the ontologies.
Members of the COVID-19 Ontology Harmonization WG made substantial efforts to characterize SARS-CoV-2 and COVID-19 data in a collaborative, computationally tractable, responsible manner. These ontologies are also being used in different use case studies, supporting productive and interoperable COVID-19 research.
The WG has also recognized many future challenges such as funding, resource and time commitment, and challenging infrastructure development. The WG members are pleased to the willingness to join the harmonization work is high, and more interested parties are joining the effort. The WG aims to continue the collaborative effort to further support the active COVID-19 research, leading to enhanced public health.

Figure 1:
Hierarchical representation of selected terms from different ontologies that are harmonized under the BFO upper level ontology. The red colors represent ontologies focused in this ontology harmonization study. Terms from many ontologies such as BFO, NCBITaxon, and VO have been used by our ontologies as well.